Much of the functionality of the *apply family is covered by the extremely popular plyr
package, the base functions remain useful and worth knowing.
apply - When you want to apply a function to the rows or columns of a matrix (and higher-dimensional analogues); not generally advisable for data frames as it will coerce to a matrix first.
# Two dimensional matrix
M <- matrix(seq(1,16), 4, 4)
# apply min to rows
apply(M, 1, min)
[1] 1 2 3 4
# apply max to columns
apply(M, 2, max)
[1] 4 8 12 16
# 3 dimensional array
M <- array( seq(32), dim = c(4,4,2))
# Apply sum across each M[*, , ] - i.e Sum across 2nd and 3rd dimension
apply(M, 1, sum)
# Result is one-dimensional
[1] 120 128 136 144
# Apply sum across each M[*, *, ] - i.e Sum across 3rd dimension
apply(M, c(1,2), sum)
# Result is two-dimensional
[,1] [,2] [,3] [,4]
[1,] 18 26 34 42
[2,] 20 28 36 44
[3,] 22 30 38 46
[4,] 24 32 40 48
If you want row/column means or sums for a 2D matrix, be sure to
investigate the highly optimized, lightning-quick colMeans
,
rowMeans
, colSums
, rowSums
.
lapply - When you want to apply a function to each element of a list in turn and get a list back.
This is the workhorse of many of the other *apply functions. Peel
back their code and you will often find lapply
underneath.
x <- list(a = 1, b = 1:3, c = 10:100)
lapply(x, FUN = length)
$a
[1] 1
$b
[1] 3
$c
[1] 91
lapply(x, FUN = sum)
$a
[1] 1
$b
[1] 6
$c
[1] 5005
sapply - When you want to apply a function to each element of a list in turn, but you want a vector back, rather than a list.
If you find yourself typing unlist(lapply(...))
, stop and consider
sapply
.
x <- list(a = 1, b = 1:3, c = 10:100)
#Compare with above; a named vector, not a list
sapply(x, FUN = length)
a b c
1 3 91
sapply(x, FUN = sum)
a b c
1 6 5005
In more advanced uses of sapply
it will attempt to coerce the
result to a multi-dimensional array, if appropriate. For example, if our function returns vectors of the same length, sapply
will use them as columns of a matrix:
sapply(1:5,function(x) rnorm(3,x))
If our function returns a 2 dimensional matrix, sapply
will do essentially the same thing, treating each returned matrix as a single long vector:
sapply(1:5,function(x) matrix(x,2,2))
Unless we specify simplify = "array"
, in which case it will use the individual matrices to build a multi-dimensional array:
sapply(1:5,function(x) matrix(x,2,2), simplify = "array")
Each of these behaviors is of course contingent on our function returning vectors or matrices of the same length or dimension.
vapply - When you want to use sapply
but perhaps need to
squeeze some more speed out of your code.
For vapply
, you basically give R an example of what sort of thing
your function will return, which can save some time coercing returned
values to fit in a single atomic vector.
x <- list(a = 1, b = 1:3, c = 10:100)
#Note that since the advantage here is mainly speed, this
# example is only for illustration. We're telling R that
# everything returned by length() should be an integer of
# length 1.
vapply(x, FUN = length, FUN.VALUE = 0L)
a b c
1 3 91
mapply - For when you have several data structures (e.g.
vectors, lists) and you want to apply a function to the 1st elements
of each, and then the 2nd elements of each, etc., coercing the result
to a vector/array as in sapply
.
This is multivariate in the sense that your function must accept multiple arguments.
#Sums the 1st elements, the 2nd elements, etc.
mapply(sum, 1:5, 1:5, 1:5)
[1] 3 6 9 12 15
#To do rep(1,4), rep(2,3), etc.
mapply(rep, 1:4, 4:1)
[[1]]
[1] 1 1 1 1
[[2]]
[1] 2 2 2
[[3]]
[1] 3 3
[[4]]
[1] 4
Map - A wrapper to mapply
with SIMPLIFY = FALSE
, so it is guaranteed to return a list.
Map(sum, 1:5, 1:5, 1:5)
[[1]]
[1] 3
[[2]]
[1] 6
[[3]]
[1] 9
[[4]]
[1] 12
[[5]]
[1] 15
rapply - For when you want to apply a function to each element of a nested list structure, recursively.
To give you some idea of how uncommon rapply
is, I forgot about it when first posting this answer! Obviously, I'm sure many people use it, but YMMV. rapply
is best illustrated with a user-defined function to apply:
#Append ! to string, otherwise increment
myFun <- function(x){
if (is.character(x)){
return(paste(x,"!",sep=""))
}
else{
return(x + 1)
}
}
#A nested list structure
l <- list(a = list(a1 = "Boo", b1 = 2, c1 = "Eeek"),
b = 3, c = "Yikes",
d = list(a2 = 1, b2 = list(a3 = "Hey", b3 = 5)))
#Result is named vector, coerced to character
rapply(l,myFun)
#Result is a nested list like l, with values altered
rapply(l, myFun, how = "replace")
tapply - For when you want to apply a function to subsets of a vector and the subsets are defined by some other vector, usually a factor.
The black sheep of the *apply family, of sorts. The help file's use of the phrase "ragged array" can be a bit confusing, but it is actually quite simple.
A vector:
x <- 1:20
A factor (of the same length!) defining groups:
y <- factor(rep(letters[1:5], each = 4))
Add up the values in x
within each subgroup defined by y
:
tapply(x, y, sum)
a b c d e
10 26 42 58 74
More complex examples can be handled where the subgroups are defined
by the unique combinations of a list of several factors. tapply
is
similar in spirit to the split-apply-combine functions that are
common in R (aggregate
, by
, ave
, ddply
, etc.) Hence its
black sheep status.
Mnemonics
lapply
is a list apply which acts on a list or vector and returns a list.sapply
is a simple lapply
(function defaults to returning a vector or matrix when possible)vapply
is a verified apply (allows the return object type to be prespecified)rapply
is a recursive apply for nested lists, i.e. lists within liststapply
is a tagged apply where the tags identify the subsetsapply
is generic: applies a function to a matrix's rows or columnsThe by
function, as stated in the documentation can be though, as a "wrapper" for tapply
. The power of by
arises when we want to compute a task that tapply
can't handle. One example is this code:
ct <- tapply(iris$Sepal.Width , iris$Species , summary )
cb <- by(iris$Sepal.Width , iris$Species , summary )
cb
iris$Species: setosa
Min. 1st Qu. Median Mean 3rd Qu. Max.
2.300 3.200 3.400 3.428 3.675 4.400
--------------------------------------------------------------
iris$Species: versicolor
Min. 1st Qu. Median Mean 3rd Qu. Max.
2.000 2.525 2.800 2.770 3.000 3.400
--------------------------------------------------------------
iris$Species: virginica
Min. 1st Qu. Median Mean 3rd Qu. Max.
2.200 2.800 3.000 2.974 3.175 3.800
ct
$setosa
Min. 1st Qu. Median Mean 3rd Qu. Max.
2.300 3.200 3.400 3.428 3.675 4.400
$versicolor
Min. 1st Qu. Median Mean 3rd Qu. Max.
2.000 2.525 2.800 2.770 3.000 3.400
$virginica
Min. 1st Qu. Median Mean 3rd Qu. Max.
2.200 2.800 3.000 2.974 3.175 3.800
If we print these two objects, ct
and cb
, we "essentially" have the same results and the only differences are in how they are shown and the different class
attributes, respectively by
for cb
and array
for ct
.
As I've said, the power of by
arises when we can't use tapply
; the following code is one example:
tapply(iris, iris$Species, summary )
Error in tapply(iris, iris$Species, summary) :
arguments must have same length
R says that arguments must have the same lengths, say "we want to calculate the summary
of all variable in iris
along the factor Species
": but R just can't do that because it does not know how to handle.
With the by
function R dispatch a specific method for data frame
class and then let the summary
function works even if the length of the first argument (and the type too) are different.
bywork <- by(iris, iris$Species, summary )
bywork
iris$Species: setosa
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
Min. :4.300 Min. :2.300 Min. :1.000 Min. :0.100 setosa :50
1st Qu.:4.800 1st Qu.:3.200 1st Qu.:1.400 1st Qu.:0.200 versicolor: 0
Median :5.000 Median :3.400 Median :1.500 Median :0.200 virginica : 0
Mean :5.006 Mean :3.428 Mean :1.462 Mean :0.246
3rd Qu.:5.200 3rd Qu.:3.675 3rd Qu.:1.575 3rd Qu.:0.300
Max. :5.800 Max. :4.400 Max. :1.900 Max. :0.600
--------------------------------------------------------------
iris$Species: versicolor
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
Min. :4.900 Min. :2.000 Min. :3.00 Min. :1.000 setosa : 0
1st Qu.:5.600 1st Qu.:2.525 1st Qu.:4.00 1st Qu.:1.200 versicolor:50
Median :5.900 Median :2.800 Median :4.35 Median :1.300 virginica : 0
Mean :5.936 Mean :2.770 Mean :4.26 Mean :1.326
3rd Qu.:6.300 3rd Qu.:3.000 3rd Qu.:4.60 3rd Qu.:1.500
Max. :7.000 Max. :3.400 Max. :5.10 Max. :1.800
--------------------------------------------------------------
iris$Species: virginica
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
Min. :4.900 Min. :2.200 Min. :4.500 Min. :1.400 setosa : 0
1st Qu.:6.225 1st Qu.:2.800 1st Qu.:5.100 1st Qu.:1.800 versicolor: 0
Median :6.500 Median :3.000 Median :5.550 Median :2.000 virginica :50
Mean :6.588 Mean :2.974 Mean :5.552 Mean :2.026
3rd Qu.:6.900 3rd Qu.:3.175 3rd Qu.:5.875 3rd Qu.:2.300
Max. :7.900 Max. :3.800 Max. :6.900 Max. :2.500
it works indeed and the result is very surprising. It is an object of class by
that along Species
(say, for each of them) computes the summary
of each variable.
Note that if the first argument is a data frame
, the dispatched function must have a method for that class of objects. For example is we use this code with the mean
function we will have this code that has no sense at all:
by(iris, iris$Species, mean)
iris$Species: setosa
[1] NA
-------------------------------------------
iris$Species: versicolor
[1] NA
-------------------------------------------
iris$Species: virginica
[1] NA
Warning messages:
1: In mean.default(data[x, , drop = FALSE], ...) :
argument is not numeric or logical: returning NA
2: In mean.default(data[x, , drop = FALSE], ...) :
argument is not numeric or logical: returning NA
3: In mean.default(data[x, , drop = FALSE], ...) :
argument is not numeric or logical: returning NA
aggregate
can be seen as another a different way of use tapply
if we use it in such a way.
at <- tapply(iris$Sepal.Length , iris$Species , mean)
ag <- aggregate(iris$Sepal.Length , list(iris$Species), mean)
at
setosa versicolor virginica
5.006 5.936 6.588
ag
Group.1 x
1 setosa 5.006
2 versicolor 5.936
3 virginica 6.588
The two immediate differences are that the second argument of aggregate
must be a list while tapply
can (not mandatory) be a list and that the output of aggregate
is a data frame while the one of tapply
is an array
.
The power of aggregate
is that it can handle easily subsets of the data with subset
argument and that it has methods for ts
objects and formula
as well.
These elements make aggregate
easier to work with that tapply
in some situations.
Here are some examples (available in documentation):
ag <- aggregate(len ~ ., data = ToothGrowth, mean)
ag
supp dose len
1 OJ 0.5 13.23
2 VC 0.5 7.98
3 OJ 1.0 22.70
4 VC 1.0 16.77
5 OJ 2.0 26.06
6 VC 2.0 26.14
We can achieve the same with tapply
but the syntax is slightly harder and the output (in some circumstances) less readable:
att <- tapply(ToothGrowth$len, list(ToothGrowth$dose, ToothGrowth$supp), mean)
att
OJ VC
0.5 13.23 7.98
1 22.70 16.77
2 26.06 26.14
There are other times when we can't use by
or tapply
and we have to use aggregate
.
ag1 <- aggregate(cbind(Ozone, Temp) ~ Month, data = airquality, mean)
ag1
Month Ozone Temp
1 5 23.61538 66.73077
2 6 29.44444 78.22222
3 7 59.11538 83.88462
4 8 59.96154 83.96154
5 9 31.44828 76.89655
We cannot obtain the previous result with tapply
in one call but we have to calculate the mean along Month
for each elements and then combine them (also note that we have to call the na.rm = TRUE
, because the formula
methods of the aggregate
function has by default the na.action = na.omit
):
ta1 <- tapply(airquality$Ozone, airquality$Month, mean, na.rm = TRUE)
ta2 <- tapply(airquality$Temp, airquality$Month, mean, na.rm = TRUE)
cbind(ta1, ta2)
ta1 ta2
5 23.61538 65.54839
6 29.44444 79.10000
7 59.11538 83.90323
8 59.96154 83.96774
9 31.44828 76.90000
while with by
we just can't achieve that in fact the following function call returns an error (but most likely it is related to the supplied function, mean
):
by(airquality[c("Ozone", "Temp")], airquality$Month, mean, na.rm = TRUE)
Other times the results are the same and the differences are just in the class (and then how it is shown/printed and not only -- example, how to subset it) object:
byagg <- by(airquality[c("Ozone", "Temp")], airquality$Month, summary)
aggagg <- aggregate(cbind(Ozone, Temp) ~ Month, data = airquality, summary)
The previous code achieve the same goal and results, at some points what tool to use is just a matter of personal tastes and needs; the previous two objects have very different needs in terms of subsetting.